a real, miniature GPT — with attention you can watch
The real architecture behind ChatGPT, shrunk until it fits in your browser: token + positional embeddings, a residual stream, stacked blocks of multi-head self-attention + a GELU MLP, LayerNorm, and a softmax head — trained live by genuine, gradient-checked backprop.
this build: {{ nLayers }} layers · {{ nHeads }} heads · {{ dDim }} dims · ctx {{ T }} · ~{{ paramCount }} weights. same shape as GPT-2, just smaller — so it babbles, but with real attention. the model lives in transformer-engine.js, a standalone file you can open and edit.
Backprop through {{ nLayers }} blocks of real attention. The loss — surprise at the true next letter — ticks down.
Each head learns its own pattern. Pick a layer and head, type a word, and watch every letter look back. Train and see them sharpen.
row = letter looking · column = letter looked at · lower triangle only: no peeking ahead.
Feed the model its own output, using the full context each step. Train it first; fresh weights spit noise.
“Look at the past” is just multiplying by a matrix. Step through how a plain average becomes learned attention — same shape, smarter weights.
{{ trickNote }}
Same architecture, every dial turned up — this lab already has the real pieces (multi-head, residual stream, LayerNorm, GELU).
Then train on the internet instead of names. The babble becomes language — but the dot products, softmax, residual stream, and gradient steps are exactly what you're running now.